Enhancing Retrieval Effectiveness of Diacritisized Arabic Passages Using Stemmer and Thesaurus
نویسندگان
چکیده
In this paper we discuss the enhancement of Arabic passage retrieval for both diacritisized and nondiacritisized text. Most previous work suggested that retrieval start with pre-processing the Arabic text to remove the diacritical marks (short vowels) to unify the text. In most cases, this process causes considerable ambiguity at the word level in the absence of context. However, searching for a word in diacritisized text requires typing and matching all its diacritical marks, which is cumbersome and prevents users from searching and hence retrieving valuable amount of text. The other way around, is to ignore these marks and fall into the problem of ambiguity. In this paper, we propose a passage retrieval approach to search for diacritic and diacritic-less text through query expansion to match a user’s query. We applied a rule-based stemmer and we compiled a huge thesaurus for this purpose. We tested our approach on the scripts of the Quran as an open domain source of diacritisized text using a set of 40 non-diacritical words obtained from testers. The results are presented and the applied approach reveals future directions for search engines.
منابع مشابه
The Effect of Combining Different Semantic Relations on Arabic Text Classification
A massive amount of documents are being posted online every minute. The task of document classification requires extensive background work on the content of documents, where keyword-based matching alone may not be sufficient. Much research has been carried out in several languages that has revealed significant results. However, Arabic documents still pose a great challenge due to the nature of ...
متن کاملCombining General Hand-Made and Automatically Constructed Thesauri for Query Expansion in Information Retrieval
One of the most intuitive ideas for enhancing the effectiveness of an information retrieval system is to include the use of a thesaurus. WordNet, as a hand-crafted and general-purpose thesaurus, intuitively should also work fine in information retrieval, but unfortunately, experimental results by many researchers have not been promising. Thereby in this paper we investigate why the use of WordN...
متن کاملLight Stemming for Arabic Information Retrieval
Computational Morphology is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. We have found, however, that a full solution to this problem is not required for effective information retrieval. Light stemming allows remarkably good information retrieval without providing correct morphological analyses. We developed several light stemmers for ...
متن کاملThe Enhancement of Arabic Stemming by Using Light Stemming and Dictionary-Based Stemming
Word stemming is one of the most important factors that affect the performance of many natural language processing applications such as part of speech tagging, syntactic parsing, machine translation system and information retrieval systems. Computational stemming is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. The existing stemmers hav...
متن کاملDomain-Specific IR for German, English and Russian Languages
In participating in this domain-specific track, our first objective is to propose and evaluate a light stemmer for the Russian language. Our second objective is to measure the relative merit of various search engines used for the German and to a lesser extent the English languages. To do so we evaluated the tf ·idf , Okapi, IR models derived from the Divergence from Randomness (DFR) paradigm, a...
متن کامل